Map Combine Map Task Split HDFS file K 1 , N 1 ( a ) Reduce Task { P 2 } { P 1 } { P 3 }

نویسندگان

  • Magdalena Balazinska
  • Dan Grossman
چکیده

In parallel query-processing environments, accurate, time-oriented progress indicators could provide much utility to users given that queries take a very long time to complete and both interand intra-query execution times can have high variance. In these systems, query times depend on the query plans and the amount of data being processed, but also on the amount of parallelism available, the types of operators (often user-defined) that perform the processing, and the overall system load. None of the techniques used by existing tools or available in the literature provide a non-trivial progress indicator for parallel queries. In this paper, we introduce Parallax, the first such indicator. Several parallel data processing systems exist. In this paper, we target environments where queries consist of a series of MapReduce jobs. Parallax builds on recently-developed techniques for estimating the progress of single-site SQL queries. It enhances and extends these techniques in non-trivial ways. We implemented our estimator in the Pig system and demonstrate its performance on experiments with the PigMix benchmark and other queries running in a real, small-scale cluster.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Matrix Multiplication in Hadoop

In a typical MapReduce job, each map task processing one piece of the input file. If two input matrices are stored in separate HDFS files, one map task would not be able to access the two input matrices at the same time. To deal with this problem, we propose a efficient matrix multiplication in Hadoop. For dense matrices, we use plain row major order to store the matrices on HDFS; For sparse ma...

متن کامل

HADOOP: A Framework for Distributed Computing

With data growing so rapidly and the rise of unstructured data accounting for about 90 % of the data today, the time has come for the enterprises to re-evaluate their approach to data storage, management and its analysis. This enormously growing data has been given the name Big Data. Hadoop platform has been designed to tackle the problems associated with handling such an enormous data-that doe...

متن کامل

Heterogeneous Multi core processors for improving the efficiency of Market basket analysis algorithm in data mining

-Heterogeneous multi core processors can offer diverse computing capabilities. The efficiency of Market Basket Analysis Algorithm can be improved with heterogeneous multi core processors. Market basket analysis algorithm utilises apriori algorithm and is one of the popular data mining algorithms which can utilise Map/Reduce framework to perform analysis. The algorithm generates association rule...

متن کامل

Ω SU ( n ) does not Split in 2 Suspensions , for n ≥ 3

Solving a conjecture of Hopkins and Mahowald, the second author [Ri] showed that Mitchell’s [Mi3] filtration {Fn,k}k=1 of ΩSU(n) splits stably, analogous to the Snaith [Sn2] splitting of BU . Crabb and Mitchell [C-M] then gave similar splittings of ΩU(n)/O(n) and ΩU(2n)/Sp(n). The first filtration Fn,1 is the inclusion CPn−1 ⊂ ΩSU(n), which was actually known to split off by the work of James [...

متن کامل

Algebraic Topology and Distributed Computing: A Primer

they are elementary, being fully covered in the rst chapter of Munkres' standard textbook [18]. Our discussion focuses on a class of problems called decision tasks, described in Section 2. In Section 3, we show how decision tasks can be modeled using simplicial complexes, a standard combinatorial structure from elementary topology. In Section 4, we review the notion of a chain complex, which pr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009